fix(core): use (type, id) keys in vector search hydration to prevent id collisions#986
Merged
Merged
Conversation
…id collisions Root cause: entity, observation, and relation rows in search_index carry ids from independent auto-increment sequences, so rows of different types routinely share the same numeric id (guaranteed in young databases). _search_vector_only parsed each vector hit's chunk_key (e.g. 'entity:4:0') but discarded the type, and _fetch_search_index_rows_by_ids keyed its result dict by bare row.id with no type discrimination. Whichever row the database returned last clobbered the other in the dict; the clobbered hit then hydrated against the wrong row or found None and was silently dropped from results. The FTS-filter branch already guarded this with (id, type) tuples, but the primary vector lookup path and the hybrid fusion maps missed the same treatment. Fix: introduce a SearchIndexKey = tuple[str, int] alias and key every map in the vector/hybrid retrieval path by (type, id) — the similarity and chunk maps in _search_vector_only, the _fetch_search_index_rows_by_ids result, the FTS-filter allowed keys, and the rows/fts/vec/fused score maps in _search_hybrid. The SQL stays unchanged; bare ids are deduped before the IN query and rows are discriminated by row.type when building dict keys. Tests: end-to-end SQLite regression test indexes an entity row and a relation row sharing id 7, syncs vectors for both, and asserts vector search returns both rows (and that the entity survives a search_item_types filter); a hybrid fusion unit test asserts an entity and relation sharing id 1 stay distinct with single-source scores. Both fail without the fix. Existing mocked vector tests updated for tuple keys. Fixes #982 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: phernandez <paul@basicmachines.co>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Root cause
Entity, observation, and relation rows in
search_indexcarry ids from independent auto-increment sequences, so rows of different types routinely share the same numeric id — guaranteed in young databases. Insrc/basic_memory/repository/search_repository_base.py:_search_vector_onlyparsed each vector hit'schunk_key(e.g.entity:4:0) but discarded the type, keyingsimilarity_by_si_id/chunks_by_si_idby bare id, which also collapsed colliding hits into one map slot._fetch_search_index_rows_by_idsfetchedWHERE id IN (...)and keyed its result dict by barerow.id— whichever row the database returned last clobbered the other, so the clobbered hit hydrated against the wrong row or was silently dropped (e.g. an entity vanishing entirely under asearch_item_typesfilter when a relation shared its id)._search_hybridfused FTS and vector results on bare row id, merging unrelated rows of different types and granting them a spurious dual-source fusion bonus.The FTS-filter branch already guarded exactly this with
(id, type)tuples — the primary vector lookup path and hybrid fusion missed the same treatment.Fix
Introduce a
SearchIndexKey = tuple[str, int]type alias and key every map in the vector/hybrid retrieval path by(type, id): the similarity and chunk maps in_search_vector_only, the_fetch_search_index_rows_by_idsresult, the FTS-filter allowed-keys set, and the rows/fts/vec/fused score maps in_search_hybrid. SQL stays unchanged; bare ids are deduplicated before theINquery and rows are discriminated byrow.typewhen building dict keys. The fix lives in the shared base class, so both SQLite and Postgres backends are covered.Test evidence
test_sqlite_vector_search_survives_cross_type_id_collision(real SQLite + sqlite-vec harness): indexes an entity row and a relation row sharingid=7, syncs vectors for both, asserts vector retrieval returns both rows with correct types, and that asearch_item_types=[ENTITY]filter still returns the entity.test_cross_type_id_collision_keeps_both_results(hybrid fusion): an FTS entity and a vector relation sharingid=1stay distinct with their single-source scores — no cross-type merge or fusion bonus.test_vector_threshold.py,test_vector_pagination.py) updated for tuple-keyed fetch results.tests/repository/, search service, semantic search, and search schema suites;ruffandtyclean onsrc tests test-int.Fixes #982
🤖 Generated with Claude Code